Airbnb, .
Airbnb not only has changed the possibilites of travel and ways of living, but also brought new business potetials. We are interested in exploring the data generated from Airbnb, analyzing the data to find interesting facts about airbnb listing in New York. We hope we could generate some useful insights to provide guidances for customers and business suggestions for hosts.
Data source: http://insideairbnb.com/get-the-data.html.
The data source provides a dataset of information from airbnb. We use the most recent (Sep, 2019) dataset for New York. The data is not cleaned, so we need to spend some time to tidy it.
The first obstacle is that the data file is relatively large. The csv. file downlowded is more than 180MB. So we just select the columns relavent to our analysis. We also exclude lists that without reviews, as we want to focus on.
【Another obstacle is that some information is compressed into one cell, like amenities. Amenities compress all the amenities provided in to one lone string, and it is not seperated by simple delimiter. We have tried to transform amenities into a seperate tidy data frame. We will try to clean other similar columns.】(不用的话可以删了)
Additionally, we define an active list as the onces received reviews within the past 12 months. Also we only includes data having price larger than zero, since we found that occurence of zero in price may due to error in data colleting.
For the cleaning fee (cleaning_fee) and security deposit (security_deposit), it is intuitive to replace the missings by 0. There are small amount of other variables are missing, we simply exclude the listings. It might be also due to collection error of the original data.
Finally, there are 28098 listings and 62 variables for our analysis.
amentities for backup
#split amenities into tidy data frame
##function to modify amenities string
split_amenities<-function(x){
#x is an input string
x<-str_replace_all(x,"[{}]",",") #remove {}
temp<-str_split(x,"\\\"")[[1]] #split by \"
temp[str_starts(temp,"[,]") | str_ends(temp,"[,]")]<-str_replace_all(temp[str_starts(temp,"[,]") | str_ends(temp,"[,]")],"[,]","|")
temp<-str_remove(temp,"^\\|")
temp<-str_remove(temp,"\\|$")
temp<-temp[temp!=""]
out<-paste(temp,collapse = "|")
return(out)
}
#38 list has no amenities
dat2<-dat1%>%select(id, amenities)%>%
rowwise()%>%
mutate(amenity = split_amenities(amenities))
dat3<-dat2%>%separate_rows(amenity,sep="\\|")
##output amenities
#write_csv(dat3%>%select(-amenities),"./data/amenities_201909.csv")
The majority of listings in our data are among Manhattan and Brooklyn. The closer to the new york city downtown area, the denser the listings.
## map for listings
data = airbnb_cleaned
data_map_all <- data %>% dplyr::select(id, longitude, latitude)
sbbox <- make_bbox(lon = data_map_all$longitude, lat = data_map_all$latitude, f = .001)
ny_map_all <- get_map(location = sbbox, maptype = "satellite", source = "google")
ggmap(ny_map_all) +
geom_point(data = data_map_all, mapping = aes(x = longitude, y = latitude), color = "red", size = 0.0011, alpha = 0.6)
Booking price is always an important factor both customers and hosts care about. In this section, we want to explore the facts of airbnb booking price in the market of New York.
## $title
## [1] "Distribution of Price"
##
## attr(,"class")
## [1] "labels"
This is a general plot of distribution of price, we could see that it is skewed right with very long tail. Thus, we do a log 10 transformation so that we could have better visualization. The distribution shows that the median of price is 100 .
Based on the price distribution above, we are interested in how prices are affected by different locations. We plotted the below push pin map which shows the price distribution among new york districts.
## map for different price
ny_price <- data %>% mutate(price_group =
case_when(price <= 100 ~ '$100',
price <= 250 ~ '$100-$250',
price <= 500 ~ '$250-$500',
TRUE ~ '>$500'))
data_map_price <- ny_price %>% select(id, longitude, latitude, price_group)
sbbox3 <- make_bbox(lon = data_map_price$longitude, lat = data_map_price$latitude, f = .001)
ny_map_price <- get_map(location = sbbox3, maptype = "satellite", source = "google")
ggmap(ny_map_price) +
geom_point(data = data_map_price, mapping = aes(x = longitude, y = latitude, color = price_group), size = 1, alpha = 0.5)
Most of the listings are among $100-$250 per night and the closer we are to new york downtown, the higher the price. The purple points indicate expensive listings above $250 per night. We can rarely see listings that cost more than $500 per night.
Indeed, price is largely affected by location. To better illustrate, we plotted the average price per night for the five neighborhoods and we can see that Bronx, Staten Island, Queens are almost the same. Brooklyn and Manhattan are more expensice by about $50 to $100 per night, which is a large amount considering our majority of the prices are around $100 per night.
#average price for different area
avg_airbnb <- airbnb_cleaned %>%
group_by(neighbourhood_group_cleansed) %>%
summarize(avg_price = mean(price)) %>%
arrange(desc(avg_price))
colnames(avg_airbnb) <- c("neighbourhood","avg_price" )
ggplot(avg_airbnb) +
geom_bar(aes(x = reorder(neighbourhood, avg_price), y = avg_price),stat="identity") +
xlab("Neighbourhood") +
ggtitle("Average price for different area ")
#Price distribution of different neighborhood
ggplot(airbnb_cleaned, aes(x= price, color = neighbourhood_group_cleansed)) + geom_density() +
ggtitle("Price distribution in different neighbourhoods")
### What Makes Some Listings Most Expensive?
high_price <- airbnb_cleaned %>%
filter(price > 600)
high_price %>%
group_by(neighbourhood_group_cleansed) %>%
summarise(counts = n())
## # A tibble: 4 x 2
## neighbourhood_group_cleansed counts
## <chr> <int>
## 1 Bronx 1
## 2 Brooklyn 41
## 3 Manhattan 155
## 4 Queens 7
high_price %>%
group_by(neighbourhood_cleansed) %>%
summarise(counts = n())%>%
arrange(desc(counts))
## # A tibble: 46 x 2
## neighbourhood_cleansed counts
## <chr> <int>
## 1 Midtown 20
## 2 Upper East Side 12
## 3 West Village 12
## 4 Hell's Kitchen 11
## 5 SoHo 11
## 6 East Village 9
## 7 Upper West Side 9
## 8 Chelsea 8
## 9 Harlem 7
## 10 Kips Bay 7
## # … with 36 more rows
high_price %>%
group_by(room_type) %>%
summarise(counts = n())%>%
arrange(desc(counts))
## # A tibble: 3 x 2
## room_type counts
## <chr> <int>
## 1 Entire home/apt 182
## 2 Private room 13
## 3 Hotel room 9
high_price %>%
group_by(bedrooms) %>%
summarise(counts = n()) %>%
arrange(desc(counts))
## # A tibble: 10 x 2
## bedrooms counts
## <dbl> <int>
## 1 3 64
## 2 2 46
## 3 4 40
## 4 1 19
## 5 0 12
## 6 5 11
## 7 7 5
## 8 6 3
## 9 8 2
## 10 9 2
#Price distribution of different room type
ggplot(airbnb_cleaned, aes(x= price, color = room_type)) + geom_density() +
ggtitle("Price distribution of different room type")
avg_airbnb <- airbnb_cleaned %>%
group_by(neighbourhood_group_cleansed, room_type) %>%
summarize(avg_price = mean(price)) %>%
arrange(neighbourhood_group_cleansed)
ggplot(avg_airbnb, aes(x = reorder(neighbourhood_group_cleansed, avg_price), y = avg_price, fill = room_type)) +
geom_bar(stat="identity",position = "dodge") + ggtitle("Average price for different neighborhoods with room type")+
xlab("neighbourhood")
Surprisingly, at such an expensive living area like Manhattan, lots of listings are “Entire Home/Apt”. It’s true that Manhattan has the most expensive listings but they actually also have relative good qualities (instead of being all shared rooms and small private rooms).
## map for different room types
data_map_room <- data[sample(nrow(data)),] %>% select(id, longitude, latitude, room_type)
sbbox1 <- make_bbox(lon = data_map_room$longitude, lat = data_map_room$latitude, f = .001)
ny_map_room <- get_map(location = sbbox1, maptype = "satellite", source = "google")
ggmap(ny_map_room) +
geom_point(data = data_map_room, mapping = aes(x = longitude, y = latitude, color = room_type), size = 1, alpha = 0.5)
#Top average price
avg_airbnb <- airbnb_cleaned %>%
group_by(airbnb_cleaned$neighbourhood_cleansed) %>%
summarize(avg_price = mean(price)) %>%
arrange(desc(avg_price))
colnames(avg_airbnb) <- c("neighbourhood","avg_price" )
ggplot(avg_airbnb[1:20,]) +
geom_bar(aes(x = reorder(neighbourhood, avg_price), y = avg_price),stat="identity") +
coord_flip() + ggtitle("Top 20 average price ")
#Top average price(different room type)
neigh_list <- dplyr::pull(avg_airbnb[1:10,1])
unique(airbnb_cleaned$room_type)
## [1] "Entire home/apt" "Private room" "Shared room" "Hotel room"
temp <- airbnb_cleaned %>%
group_by(neighbourhood_cleansed, room_type) %>%
summarize(avg_price = mean(price))
temp %>% filter(room_type == "Private room") %>% arrange(desc(avg_price))
## # A tibble: 202 x 3
## # Groups: neighbourhood_cleansed [202]
## neighbourhood_cleansed room_type avg_price
## <chr> <chr> <dbl>
## 1 Bay Terrace Private room 265
## 2 Breezy Point Private room 195
## 3 Belle Harbor Private room 178.
## 4 Theater District Private room 175.
## 5 Midtown Private room 171.
## 6 Tribeca Private room 151.
## 7 NoHo Private room 139.
## 8 West Village Private room 138.
## 9 Murray Hill Private room 136
## 10 SoHo Private room 132.
## # … with 192 more rows
#Distribution of property type
temp <- airbnb_cleaned %>%
group_by(property_type) %>%
summarise(counts = n())%>%
arrange(desc(counts)) %>%
filter(counts > 5)
ggplot(temp) +
geom_bar(aes(x = reorder(property_type, counts),y = counts), stat = 'identity') + coord_flip() +
ggtitle("distribution of property type ")
#Data Importing and Cleaning
data <- read_csv ("listings_201909.csv", na=c("","NA","N/A"))
##select columns
dat<-data%>%
select(-c(scrape_id:xl_picture_url))%>%
select(-host_url, -host_name, -host_location, -host_about,
-host_acceptance_rate, -host_thumbnail_url, -host_picture_url,
-host_neighbourhood)%>%
select(-street,-city,-state, -market, -smart_location, -country,
-country_code)%>%
select(-jurisdiction_names, -license,-weekly_price,-monthly_price,
-square_feet)%>%
select(-c(calendar_updated:calendar_last_scraped))
#dim(dat) #48377 62
##modify data type
dat<-dat%>%
mutate_at(c("host_response_rate","extra_people",
"price","security_deposit","cleaning_fee"),
str_remove_all,pattern="[%$]")%>%
mutate_at(c("host_response_rate","extra_people",
"price","security_deposit","cleaning_fee"),
as.numeric)
##select lists which have reviews within the last 12 months
data_cleaned<-dat%>%filter(!is.na(first_review))%>%
filter(last_review>='2018-09-01')
#dim(dat1) #28105 62
#clean NA
##replace missing values
data_cleaned$cleaning_fee[is.na(data_cleaned$cleaning_fee)] <- 0
data_cleaned$security_deposit[is.na(data_cleaned$security_deposit)] <- 0
data_cleaned<-data_cleaned%>%filter(price>0)
##exclude missing values
completeFun <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
data_cleaned <- completeFun(data_cleaned, c("review_scores_value", "review_scores_checkin", "review_scores_accuracy", "review_scores_communication", "review_scores_cleanliness","review_scores_rating","neighbourhood", "review_scores_location", "price", "bedrooms", "beds","bathrooms", "host_identity_verified","zipcode"))
#sort(colSums(is.na(data_cleaned)),decreasing = TRUE)
dim(data_cleaned) #28098
#cleaned data output
##write_csv(data_cleaned,"data_cleaned.csv")